Topics Covered Min
Data Cleaning 10
Model Training 10
Intermediate Viualization 10
Bonus Highlight: Learn how to make memes with R! 10

Data Cleaning

Let’s load in my fitbit data for the last month for activity and sleep.

activity <- read.csv("Activity.csv", stringsAsFactors = FALSE)
sleep <- read.csv("Sleep.csv", stringsAsFactors = FALSE)

Let’s take a look at the data.

str(activity)
## 'data.frame':    21 obs. of  10 variables:
##  $ Date                  : chr  "2019-09-01" "2019-09-02" "2019-09-03" "2019-09-04" ...
##  $ Calories.Burned       : chr  "2,237" "2,843" "2,299" "2,863" ...
##  $ Steps                 : chr  "2,774" "9,983" "6,033" "11,209" ...
##  $ Distance              : num  1.17 4.2 2.54 4.71 0.93 1.59 6.2 1.08 1.13 3.05 ...
##  $ Floors                : int  2 25 20 44 0 6 256 3 5 8 ...
##  $ Minutes.Sedentary     : chr  "678" "625" "841" "703" ...
##  $ Minutes.Lightly.Active: int  139 175 122 177 96 105 234 145 105 219 ...
##  $ Minutes.Fairly.Active : int  15 22 9 28 2 0 61 0 12 17 ...
##  $ Minutes.Very.Active   : int  6 52 20 57 11 0 165 0 1 15 ...
##  $ Activity.Calories     : chr  "687" "1,353" "753" "1,445" ...
str(sleep)
## 'data.frame':    20 obs. of  9 variables:
##  $ Start.Time          : chr  "2019-09-19 12:47AM" "2019-09-18 1:04AM" "2019-09-17 12:04AM" "2019-09-16 5:21AM" ...
##  $ End.Time            : chr  "2019-09-19 7:51AM" "2019-09-18 7:39AM" "2019-09-17 8:04AM" "2019-09-16 11:29AM" ...
##  $ Minutes.Asleep      : int  358 335 416 303 0 382 493 429 451 438 ...
##  $ Minutes.Awake       : int  66 60 64 65 0 27 67 45 68 71 ...
##  $ Number.of.Awakenings: int  27 18 38 24 0 27 37 29 34 24 ...
##  $ Time.in.Bed         : int  424 395 480 368 0 409 560 474 519 509 ...
##  $ Minutes.REM.Sleep   : chr  "70" "59" "66" "56" ...
##  $ Minutes.Light.Sleep : chr  "186" "188" "214" "197" ...
##  $ Minutes.Deep.Sleep  : chr  "102" "88" "136" "50" ...

Notice how several of the variables are stored as Characters. This is a problem because R thinks they are text values instead of dates or numbers.

Let’s clean up the data types.

#install.packages("dplyr")
#install.packages("lubridate")
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date

activity <- activity %>% mutate(
    Date = ymd(Date), 
    Calories.Burned = as.numeric(gsub(",", "", Calories.Burned)), 
    Steps = as.numeric(gsub(",", "", Steps)),
    Minutes.Sedentary = as.numeric(gsub(",", "", Minutes.Sedentary)),  
    Minutes.Lightly.Active = as.numeric(gsub(",", "", Minutes.Lightly.Active)),
    Minutes.Fairly.Active = as.numeric(gsub(",", "", Minutes.Fairly.Active)),
    Minutes.Very.Active = as.numeric(gsub(",", "", Minutes.Very.Active)),
    Activity.Calories = as.numeric(gsub(",", "", Activity.Calories))
)

sleep <- sleep %>% mutate(
    Start.Time = ymd_hm(Start.Time), 
    End.Time = ymd_hm(End.Time), 
    Minutes.REM.Sleep = as.numeric(gsub("/","", Minutes.REM.Sleep)),
    Minutes.Light.Sleep = as.numeric(gsub("/","", Minutes.Light.Sleep)),
    Minutes.Deep.Sleep = as.numeric(gsub("/","", Minutes.Deep.Sleep)),
    Date = date(End.Time)
    
)
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion

If you look at the sleep data, there are some rows with 0 sleep recorded or less than two hours recorded. Let’s assume these are nights where I charged my fitbit overnight and didn’t record my sleep (and the less than 2 hours could be a nap). We want to get rid of these records since they will interfere with any models.

sleep = sleep %>% filter(Minutes.Asleep > 120)

Similarily, in the activity data set, if there are 0 steps, the data probably didn’t sync yet or I didn’t wear my fitbit that day.

activity = activity %>% filter(Steps > 0)

I want to be able to use my sleeping data with my activity data so I need to merge them together.

fitbit = inner_join(sleep, activity, by = "Date")

Model Training

Now I have a clean (although small) data set. Let’s build a simple model that predicts Calories burned based on some of the other information. There’s not much you can do with a data set this small, but let’s try something anyways to learn!

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2

inTraining <- createDataPartition(fitbit$Calories.Burned, p = .80, list = FALSE)
training <- fitbit[ inTraining,]
testing  <- fitbit[-inTraining,]

set.seed(825)
bayesFit <- train(Calories.Burned ~ Minutes.Asleep + Minutes.Deep.Sleep + Steps + Minutes.Very.Active, data = training, 
                 method = "bayesglm")
bayesFit
## Bayesian Generalized Linear Model 
## 
## 16 samples
##  4 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 16, 16, 16, 16, 16, 16, ... 
## Resampling results:
## 
##   RMSE      Rsquared  MAE     
##   139.6299  0.894614  114.9371

Let’s check how our model performs on our test set

paste('Actual: ', testing$Calories.Burned)
## [1] "Actual:  2060"
paste('Predicted: ', predict(bayesFit, newdata = testing))
## [1] "Predicted:  2080.36669421062"

Intermediate Visualization

Look how annoying 3D visualizations are to manipulate :p

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

plot_ly(data = fitbit, 
        x = ~Date, 
        y = ~Minutes.Very.Active,
        z = ~Calories.Burned)%>%
  add_markers() %>%
  layout(scene = list(xaxis = list(title = 'Date'),
                     yaxis = list(title = 'Minutes Active'),
                     zaxis = list(title = 'Calories Burned')))

Let’s do better by using size, colour, etc to keep track of dimensions

library(plotly)
library(broom)

m <- loess(Calories.Burned ~ Minutes.Very.Active, data = fitbit)

fitbit %>% 
  plot_ly(x = ~Minutes.Very.Active)%>%
  add_markers(y = ~Calories.Burned, size = ~Steps, text = fitbit$Date, showlegend = FALSE, name = "Day") %>%
  add_lines(y = ~fitted(loess(Calories.Burned ~ Minutes.Very.Active)),
            line = list(color = '#07A4B5'),
            name = "Loess Smoother", showlegend = TRUE)  %>%
  add_ribbons(data = augment(m),
              ymin = ~.fitted - 1.96 * .se.fit,
              ymax = ~.fitted + 1.96 * .se.fit,
              line = list(color = 'rgba(7, 164, 181, 0.05)'),
              fillcolor = 'rgba(7, 164, 181, 0.2)',
              name = "Standard Error")   %>%
  layout(xaxis = list(title = 'Minutes Very Active'),
         yaxis = list(title = 'Calories Burned'),
         legend = list(x = 0.80, y = 0.20))

Bonus Highlight: Learn how to make memes with R!

The great thing about about source languages: anybody can develop a package (including you!). Here’s one of my favourites! https://cran.r-project.org/web/packages/meme/vignettes/meme.html

#install.packages("meme")
library(meme)
## Warning: package 'meme' was built under R version 3.5.3
#Only need to run the following line if you are using Windows
if (.Platform$OS.type == "windows") {
    windowsFonts(
        Impact = windowsFont("Impact"),
        Courier = windowsFont("Courier")
    )
}

u <- system.file("success.jpg", package="meme")
myMeme <- meme(u, "went to an R Workshop","learned how to make memes with code")
myMeme

meme_save(myMeme, file="successR.png")